In [ ]:
%matplotlib inline

Probability and Statistics Review

Based on "Machine Learning: A Probabilistic Perspective" by Kevin P. Murphy (Chapter 2).

What is probability?

  • At least two different interpretations:
    • Frequentist: probabilities are long-run frequencies of events
    • Bayesian: probabilities are used to quantify our uncertainty

One advantage of the Bayesian interpretation is that it can be used to model events that do not have long-term frequencies.

A brief review of probability theory

Discrete random variables

$p(A)$ denotes the probability that the event $A$ is true

  • $0 \leq p(A) \leq 1$

We write $p(\bar{A})$ to denote the probability of the event not $A$

  • $p(\bar{A}) = 1 - p(A)$

We can extend the notion of binary events by defining a discrete random variable $X$ which can take on any value from a finite or countably infinite set $\mathcal{X}$. We denote the probability of the event that $X = x$ by $p(X = x)$ or just $p(x)$ for short.

  • $0 \leq p(x) \leq 1$
  • $\sum_{x \in \mathcal{X}} p(x) = 1$

Let's look at some discrete distributions:


In [ ]:
import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2)

ax[0].bar([1, 2, 3, 4],[0.25, 0.25, 0.25, 0.25], align='center')
ax[0].set_ylim([0, 1])
_ = ax[0].set_xticks([1, 2, 3, 4])
ax[0].set_title('Uniform distribution')

ax[1].bar([1, 2, 3, 4],[0, 1.0, 0, 0], align='center')
ax[1].set_ylim([0, 1])
_ = ax[1].set_xticks([1, 2, 3, 4])
ax[1].set_title('Degenerate distribution')

Fundamental rules

Probability of a union of two events

Given two events, $A$ and $B$, we define the probability of $A$ or $B$ as

$$ \begin{align} p(A \lor B) &= p(A) + p(B) - p(A \land B) \\ &= p(A) + p(B) & \text{if $A$ and $B$ are mutually exclusive} \end{align} $$

Joint probabilities

We define the probability of the joint event $A$ and $B$ as

$$ p(A,B) = p(A \land B) = p(A|B)p(B) $$

Given a joint distribution on two events p(A,B), we define the marginal distribution as

$$ p(A) = \sum_b p(A,B) = \sum_b p(A|B)p(B) $$

Conditional probability

We define the conditional probability of event $A$, given that event $B$ is true, as

$$ \begin{align} p(A|B) &= \frac{p(A,B)}{p(B)} & \text{if $p(B) > 0$} \end{align} $$

Bayes' rule

Manipulating the basic definition of conditional probability gives us one of the most important formulas in probability theory

$$p(X=x|Y=y) = \frac{p(X=x,Y=y)}{P(Y=y)} = \frac{p(Y=y|X=x)p(X=x)}{\sum_{x'}p(Y=y|X=x')p(X=x')}$$

Independence and conditional independence

We say $X$ and $Y$ are unconditionally independent or marginally independent, denoted $X \perp Y$, if we can represent the joint as the product of the two marginals, i.e.,

$$X \perp Y \Longleftrightarrow p(X,Y) = p(X)p(Y)$$

In general, we say a set of variables is mutually independent if the joint can be written as a product of marginals.

We say $X$ and $Y$ are conditionally independent given $Z$ iff the conditional joint can be written as a product of conditional marginals:

$$X \perp Y|Z \Longleftrightarrow p(X,Y|Z)=p(X|Z)p(Y|Z)$$

CI assumptions allow us to build large probabilistic models from small pieces.

Continuous random variables

Suppose $X$ is some uncertain continuous quantity. The probability that $X$ lies in any interval $a \leq X \leq b$ can be computed as follows. Define the events $A = (X \leq a), B = (X \leq b)$ and $W = (a < X \leq b)$. We have that $B = A \vee W$, and since $A$ and $W$ are mutually exclusive, the sum rule gives

$$p(B) = p(A) + p(W)$$

and hence

$p(W) = p(B) - p(A)$

Define the function $F(q) \triangleq p(X \leq q)$. This is called the cumulative distribution function or cdf of $X$. This is a monotonically non-decreasing function.


In [ ]:
# CDF of Gaussian N(0,1)
import scipy.stats as stats
f = lambda x : stats.norm.cdf(x, 0, 1)
x = np.arange(-3, 3, 0.1)
y = f(x)

plt.plot(x, y, 'b')
plt.title('CDF')

Using the above notation, we have $$p(a < X \leq b) = F(b) - F(a)$$

Now define $f(x) = \frac{d}{dx} F(x)$ (we assume this derivative exists); this is called a probability density function or pdf. Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows:

$$P(a < X \leq b) = \int_a^b f(x) dx$$

In [ ]:
# PDF of Gaussian N(0,1)
# shaded area has 0.05 of the mass
# also written mu +/- 2 \sigma
f = lambda x : stats.norm.pdf(x, 0, 1)
x = np.arange(-4, 4, 0.1)
y = f(x)

plt.plot(x, y, 'b')
l_x = np.arange(-4, -1.96, 0.01)
plt.fill_between(l_x, f(l_x))
u_x = np.arange(1.96, 4, 0.01)
plt.fill_between(u_x, f(u_x))

plt.title('PDF')

We require $p(x) \geq 0$, but it is possible for $p(x)>1$ for any given $x$, so long as the density integrates to 1.


In [ ]:
# Example of p(x) > 1, Uniform distribution on (0, 0.5)
f = lambda x: stats.uniform.pdf(x, 0, 0.5)
x = np.arange(-0.5, 1, 0.01)
y = f(x)

plt.plot(x, y, 'b')
plt.title('Uniform PDF')

Mean and variance

The most familiar property of a distribution is its mean, or expected value, denoted by $\mu$. For discrete rv's, it is defined as $\mathbb{E}[X] \triangleq \sum_{x \in \mathcal{X}} x p(x)$, and for continuous rv's, it is defined as $\mathbb{E}[X] \triangleq \int_{\mathcal{X}} x p(x) dx$.

The variance is a measure of the "spread" of a distribution, denoted by $\sigma^2$. This is defined as follows:

$$ \begin{align} \text{var}[X] & \triangleq \mathbb{E}\left[ \left( X - \mu\right)^2 \right] = \int \left( x - \mu \right) ^2 p(x) dx \\\ &= \int x^2 p(x)dx + \mu^2 \int p(x) dx - 2 \mu \int x p(x) dx = \mathbb{E}[X^2] - \mu^2 \end{align} $$

from which we derive the useful result

$$\mathbb{E}[X^2] = \mu^2 + \sigma^2$$

The standard deviation is defined as

$$\text{std}[X] \triangleq \sqrt{\text{var}[X]}$$

Some common discrete distributions

The binomial and Bernoulli distributions

Suppose we toss a coin $n$ times. Let $X \in {0, \ldots, n}$ be the number of heads. If the probability of heads is $\theta$, then we say $X$ has a binomial distribution, written as $X \sim \text{Bin}(n, \theta)$. The probability mass function (pmf) is given by

$$\text{Bin}(k|n,\theta) \triangleq {n\choose k} \theta^k(1 - \theta)^{n-k}$$

where $$ {n\choose k} \triangleq \frac{n!}{(n-k)!k!}$$

is the number of ways to choose $k$ items from $n$.

This distribution has a mean of $n\theta$ and a variance of $n\theta(1-\theta)$.


In [ ]:
fig, ax = plt.subplots(1, 2)

x = np.arange(11)

f = lambda x : stats.binom.pmf(x, 10, 0.25)

ax[0].bar(x, f(x), align='center')
#ax[0].set_ylim([0, 1])
_ = ax[0].set_xticks(x)
ax[0].set_title(r'$\theta$ = 0.25')

f = lambda x : stats.binom.pmf(x, 10, 0.9)
    
ax[1].bar(x, f(x), align='center')
#ax[1].set_ylim([0, 1])
_ = ax[1].set_xticks(x)
ax[1].set_title(r'$\theta$ = 0.9')

Now suppose we toss a coin only once. Let $X \in {0,1}$ be a binary random variable, with probability of "success" or "heads" of $\theta$. We say that $X$ has a Bernoulli distribution. This is written as $X \sim \text{Ber}(\theta)$, where the pmf is defined as

$$\text{Ber}(x|\theta) = \theta^{\mathbb{I}(x=1)}(1-\theta)^{\mathbb{I}(x=0)}$$

In other words,

$$ \text{Ber}(x|\theta) = \left\{ \begin{array}{rl} \theta &\mbox{ if $x=1$} \\ 1 - \theta &\mbox{ if $x=0$} \end{array} \right. $$

This is obviously just a special case of a Binomial distribution with $n=1$.

The multinomial and multinoulli distribution

To model the outcomes of tossing a $K$-sided die, we can use the multinomial distribution. This is defined as follows: let $\mathbf{x}=(x_1, \ldots, x_K)$ be a random vector, where $x_j$ is the number of times side $j$ of the die occurs. Then $\mathbf{x}$ has the following pmf:

$$\text{Mu}(\mathbf{x}|n, \mathbf{\theta}) \triangleq {n \choose x_1,\ldots,x_K} \prod_{j=1}^K \theta_j^{x_j}$$

where $\theta_j$ is the probability that side $j$ shows up, and

$${n \choose x_1,\ldots,x_K} \triangleq \frac{n!}{x_1!x_2! \ldots x_K!}$$

is the multinomial coefficient (the number of ways to divide a set of size $n=\sum_{k=1}^K x_k$ into subsets with sizes $x_1$ up to $x_K$).

Now suppose $n=1$. This is like rolling a $K$-sided dice once, so $\mathbf{x}$ will be a vector of 0s and 1s (a bit vector), in which only one bit can be turned on. Specifically, if the dice shows up as face $k$, then the $k$'th bit will be on. In this case, we can think of $x$ as being a scalar categorical random variable with $K$ states (values), and $\mathbf{x}$ is its dummy encoding, that is, $\mathbf{x} = \left[\mathbb{I}(x=1),\ldots,\mathbb{I}(x=K)\right]$. For example, if $K=3$, we encode the states 1, 2, and 3 as $(1, 0, 0), (0, 1, 0)$ and $(0, 0, 1)$. This is also called one-hot encoding. In this case, the pmf becomes

$$\text{Mu}(\mathbf{x}|1, \mathbf{\theta}) = \prod_{j=1}^K \theta_j^{\mathbb{I}(x_j=1)}$$

This very common special case is known as a categorical or discrete distribution (Kevin Murphy's text adopts the term multinoulli distribution by analogy with the binomial/Bernoulli distinction). We will use the following notation

$$\text{Cat}(x|\mathbf{\theta}) \triangleq \text{Mu}(\mathbf{x}|1, \mathbf{\theta})$$

Some common continuous distributions

Gaussian (normal) distribution

The most widely used distribution in statistics and machine learning is the Gaussian or normal distribution. Its pdf is given by

$$\mathcal{N}(x|\mu, \sigma^2) \triangleq \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2\sigma^2}(x - \mu)^2}$$

where $\mu = \mathbb{E}[X]$ is the mean (and mode), and $\sigma^2 = \text{var}[X]$ is the variance. $\frac{1}{\sqrt{2 \pi \sigma^2}}$ is the normalization constant needed to ensure the density integrates to 1.

We write $X \sim \mathcal{N}(\mu, \sigma^2)$ to denote that $p(X=x) = \mathcal{N}(x|\mu, \sigma^2)$. If $X \sim \mathcal{N}(0,1)$, we say $X$ follows a standard normal distribution.

We will sometimes talk about the precision of a Gaussian, by which we mean the inverse variance: $\lambda = 1/\sigma^2$.

The Gaussian distribution is the most widely used distribution in statistics. Why?

  • It has two parameters that are easy to interpret
  • The central limit theorem tells us that sums of independent random variables have an approximately Gaussian distribution, making it a good fit for modeling residual errors or "noise"
  • The Gaussian distribution makes the least number of assumptions (i.e. has maximum entropy) which makes it a good default choice in many cases
  • It has a simple mathematical form, which results in easy to implement, but often highly effective methods

The Student $t$ distribution

One problem with the Gaussian distribution is that it is sensitive to outliers, since the log-probability only decays quadratically with distance from the centre. A more robust distribution is the Student $t$ distribution. Its pdf is as follows

$$\mathcal{T}(x|\mu, \sigma^2, \nu) \propto \left[ 1 + \frac{1}{\nu} \left( \frac{x-\mu}{\sigma}\right)^2\right]^{-\left(\frac{\nu + 1}{2}\right)}$$

where $\mu$ is the mean, $\sigma^2>0$ is the scale parameter, and $\nu > 0$ is called the degrees of freedom.

The distribution has the following properties:

mean = $\mu$, mode = $\mu$, var = $\frac{\nu \sigma^2}{(\nu - 2)}$

The variance is only defined if $\nu > 2$. The mean is only defined if $\nu > 1$. It is common to use $\nu = 4$, which gives good performance in a range of problems. For $\nu \gg 5$, the Student distribution rapidly approaches a Gaussian distribution and loses its robustness properties.

The Laplace distribution

Another distribution with heavy tails is the Laplace distribution, also known as the double sided exponential distribution. This has the following pdf:

$$\text{Lap}(x|\mu,b) \triangleq \frac{1}{2b} \exp \left( - \frac{|x - \mu|}{b}\right)$$

Here $\mu$ is a location parameter and $b>0$ is a scale parameter. This distribution has the following properties:

mean = $\mu$, mode = $\mu$, var = $2b^2$

Not only does it have heavier tails, it puts more probability density at 0 than the Gaussian. This property is a useful way to encourage sparsity in a model, as we will see later.


In [ ]:
# Show Gaussian, Student, Laplace pdfs and log pdfs
fig, ax = plt.subplots(2, 1, sharex=True)

g = lambda x : stats.norm.pdf(x, loc=0, scale=1)
t = lambda x : stats.t.pdf(x, df=1, loc=0, scale=1)
l = lambda x : stats.laplace.pdf(x, loc=0, scale=1/np.sqrt(2))

x = np.arange(-4, 4, 0.1)


ax[0].plot(x, g(x), 'b-', label='Gaussian')
ax[0].plot(x, t(x), 'r.', label='Student')
ax[0].plot(x, l(x), 'g--', label='Laplace')

ax[0].legend(loc='best')
ax[0].set_title('pdfs')

ax[1].plot(x, np.log(g(x)), 'b-', label='Gaussian')
ax[1].plot(x, np.log(t(x)), 'r.', label='Student')
ax[1].plot(x, np.log(l(x)), 'g--', label='Laplace')
ax[1].set_title('log pdfs')

In [ ]:
# Demonstrate fitting Gaussian, Student, and Laplace to data
# with and without outliers
n = 30  # n data points
np.random.seed(0)
data = np.random.randn(n)

outliers = np.array([8, 8.75, 9.5])
nn = len(outliers)
nbins = 7

# fit each of the models to the data (no outliers)
model_g = stats.norm.fit(data)
model_t = stats.t.fit(data)
model_l = stats.laplace.fit(data)

fig, ax = plt.subplots(2, 1, sharex=True)

x = np.arange(-10, 10, 0.1)

g = lambda x : stats.norm.pdf(x, loc=model_g[0], scale=model_g[1])
t = lambda x : stats.t.pdf(x, df=model_t[0], loc=model_t[1], scale=model_t[2])
l = lambda x : stats.laplace.pdf(x, loc=model_l[0], scale=model_l[1])

ax[0].hist(data, bins=25, range=(-10, 10),
           normed=True, alpha=0.25, facecolor='gray')
ax[0].plot(x, g(x), 'b-', label='Gaussian')
ax[0].plot(x, t(x), 'r.', label='Student')
ax[0].plot(x, l(x), 'g--', label='Laplace')

ax[0].legend(loc='best')
ax[0].set_title('no outliers')

# fit each of the models to the data (with outliers)
newdata = np.r_[data, outliers]  # row concatenation
model_g = stats.norm.fit(newdata)
model_t = stats.t.fit(newdata)
model_l = stats.laplace.fit(newdata)


g = lambda x : stats.norm.pdf(x, loc=model_g[0], scale=model_g[1])
t = lambda x : stats.t.pdf(x, df=model_t[0], loc=model_t[1], scale=model_t[2])
l = lambda x : stats.laplace.pdf(x, loc=model_l[0], scale=model_l[1])

ax[1].hist(newdata, bins=25, range=(-10, 10),
           normed=True, alpha=0.25, facecolor='gray')
ax[1].plot(x, g(x), 'b-', label='Gaussian')
ax[1].plot(x, t(x), 'r.', label='Student')
ax[1].plot(x, l(x), 'g--', label='Laplace')


ax[1].set_title('with outliers')

Joint probability distributions

A joint probability distribution has the form $p(x_1,\ldots,x_D)$ for a set of $D>1$ variables, and models the (stochastic) relationships between the variables. If all the variables are discrete, we can represent the joint distribution as a big multi-dimensional array, with one variable per dimension. However, the number of parameters needed to define such a model is $O(K^D)$, where $K$ is the number of states for each variable.

We can define high dimensional joint distributions using fewer parameters by making conditional independence assumptions. In the case of continuous distributions, an alternative approach is to restrict the form of the pdf to certain functional forms, some of which are examined below.

Covariance and correlation

The covariance between two rv's $X$ and $Y$ measures the degree to which $X$ and $Y$ are (linearly) related. Covariance is defined as

$$\text{cov}[X,Y] \triangleq \mathbb{E}\left[\left(X - \mathbb{E}[X]\right)\left(Y - \mathbb{E}[Y]\right)\right]=\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

If $\mathbf{x}$ is a $d$-dimensional random vector, its covariance matrix is defined to be the following symmetric, positive semi-definite matrix:

$$ \begin{align} \text{cov}[\mathbf{x}] & \triangleq \mathbf{E} \left[\left(\mathbf{x} - \mathbb{E}[\mathbf{x}]\right)\left(\mathbf{x} - \mathbb{E}[\mathbf{x}]\right)^T\right]\\ & = \left( \begin{array}{ccc} \text{var}[X_1] & \text{cov}[X_1, X_2] & \ldots & \text{cov}[X_1, X_d] \\ \text{cov}[X_2, X_1] & \text{var}[X_2] & \ldots & \text{cov}[X_2, X_d] \\ \vdots & \vdots & \ddots & \vdots\\ \text{cov}[X_d, X_1] & \text{cov}[X_d, X_2] & \ldots & \text{var}[X_d] \end{array} \right) \end{align} $$

Covariances can be between $-\infty$ and $\infty$. Sometimes it is more convenient to work with a normalized measure, with finite bounds. The (Pearson) correlation coefficient between $X$ and $Y$ is defined as

$$\text{corr}[X,Y] \triangleq \frac{\text{cov}[X,Y]}{\sqrt{\text{var}[X]\text{var}[Y]}}$$

A correlation matrix has the form

$$ \mathbf{R} = \left( \begin{array}{ccc} \text{corr}[X_1, X_1] & \text{corr}[X_1, X_2] & \ldots & \text{corr}[X_1, X_d] \\ \text{corr}[X_2, X_1] & \text{corr}[X_2, X_2] & \ldots & \text{corr}[X_2, X_d] \\ \vdots & \vdots & \ddots & \vdots\\ \text{corr}[X_d, X_1] & \text{corr}[X_d, X_2] & \ldots & \text{corr}[X_d, X_d] \end{array} \right) $$

One can show that $-1 \leq \text{corr}[X,Y] \leq 1$. Hence, in a correlation matrix, each entry on the diagonal is 1, and the other entries are between -1 and 1. One can also show that $\text{corr}[X,Y]=1$ iff $Y=aX + b$ for some parameters $a$ and $b$, i.e. there is a linear relationship between $X$ and $Y$. A good way to think of the correlation coefficient is as a degree of linearity.

If $X$ and $Y$ are independent, meaning $p(X,Y)=p(X)p(Y)$, then $\text{cov}[X,Y]=0$, and hence $\text{corr}[X,Y]=0$ so they are uncorrelated. However, the converse is not true: uncorrelated does not imply independent. Some striking examples are shown below.

Source: http://upload.wikimedia.org/wikipedia/commons/0/02/Correlation_examples.png

The multivariate Gaussian

The multivariate Gaussian or multivariate normal (MVN) is the most widely used joint probability density function for continuous variables. The pdf of the MVN in $D$ dimensions is defined by the following

$$\mathcal{N}(\mathbf{x}|\boldsymbol\mu,\mathbf{\Sigma}) \triangleq \frac{1}{(2 \pi)^{D/2}|\mathbf{\Sigma}|^{1/2}} \exp \left[ - \frac{1}{2} \left(\mathbf{x} - \boldsymbol\mu \right)^T \mathbf{\Sigma}^{-1} \left(\mathbf{x} - \boldsymbol\mu\right)\right]$$

where $\boldsymbol\mu = \mathbb{E}[\mathbf{x}] \in \mathbb{R}^D$ is the mean vector, and $\Sigma = \text{cov}[\mathbf{x}]$ is the $D \times D$ covariance matrix. Sometimes we will work in terms of the precision matrix or concentration matrix instead. This is just the inverse covariance matrix, $\Lambda = \Sigma^{-1}$. The normalization constant $(2 \pi)^{-D/2}|\Lambda|^{1/2}$ ensures that the pdf integrates to 1.

The figure below plots some MVN densities in 2d for three different kinds of covariance matrices. A full covariance matrix has $D(D+1)/2$ parameters (we divide by 2 since $\Sigma$ is symmetric). A diagonal covariance matrix has $D$ parameters, and has 0s on the off-diagonal terms. A spherical or isotropic covariance, $\Sigma = \sigma^2 \mathbf{I}_D$, has one free parameter.


In [ ]:
# plot a MVN in 2D and 3D
import matplotlib.mlab as mlab
from scipy.linalg import eig, inv
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

delta = 0.05
x = np.arange(-10.0, 10.0, delta)
y = np.arange(-10.0, 10.0, delta)
X, Y = np.meshgrid(x, y)

S = np.asarray([[2.0, 1.8],
                [1.8, 2.0]])
mu = np.asarray([0, 0])

Z = mlab.bivariate_normal(X, Y, sigmax=S[0, 0], sigmay=S[1, 1], 
                          mux=mu[0], muy=mu[1], sigmaxy=S[0, 1])

#fig, ax = plt.subplots(2, 2, figsize=(10, 10),
#                       subplot_kw={'aspect': 'equal'})

fig = plt.figure(figsize=(10, 10))

ax = fig.add_subplot(2, 2, 1)

CS = ax.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)

ax.set_xlim((-6, 6))
ax.set_ylim((-6, 6))
ax.set_title('full')

# Decorrelate
[D, U] = eig(S)

S1 = np.dot(np.dot(U.T, S), U)

Z = mlab.bivariate_normal(X, Y, sigmax=S1[0, 0], sigmay=S1[1, 1], 
                          mux=mu[0], muy=mu[0], sigmaxy=S1[0, 1])

ax = fig.add_subplot(2, 2, 2)
CS = ax.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)


ax.set_xlim((-10, 10))
ax.set_ylim((-5, 5))
ax.set_title('diagonal')

# Whiten
A = np.dot(np.sqrt(np.linalg.inv(np.diag(np.real(D)))), U.T)
mu2 = np.dot(A, mu)
S2 = np.dot(np.dot(A, S), A.T)  # may not be numerically equal to I


#np.testing.assert_allclose(S2, np.eye(2))  # check
print np.allclose(S2, np.eye(2))

# plot centred on original mu, not shifted mu
Z = mlab.bivariate_normal(X, Y, sigmax=S2[0, 0], sigmay=S2[1, 1], 
                          mux=mu[0], muy=mu[0], sigmaxy=S2[0, 1])

ax = fig.add_subplot(2, 2, 3)
CS = ax.contour(X, Y, Z)
plt.clabel(CS, inline=1, fontsize=10)

ax.set_xlim((-6, 6))
ax.set_ylim((-6, 6))
ax.set_title('spherical')

# demonstration of how to do a surface plot
axx = fig.add_subplot(2, 2, 4, projection='3d')
surf = axx.plot_surface(X, Y, Z, rstride=5, cstride=5, cmap=cm.coolwarm,
                        linewidth=0, antialiased=False)
axx.set_title('spherical')